Mining atomic Chinese abbreviations with a probabilistic single character recovery model
نویسندگان
چکیده
An HMM-based single character recovery (SCR) model is proposed in this paper to extract a large set of atomic abbreviations and their full forms from a text corpus. By an ‘‘atomic abbreviation,’’ it refers to an abbreviated word consisting of a single Chinese character. This task is important since Chinese abbreviations cannot be enumerated exhaustively but the abbreviation process for compound words seems to be compositional. One can often decode an abbreviated word character by character to its full form. With a large atomic abbreviation dictionary, one may be able to handle multiple character abbreviation problems more easily based on the compositional property of abbreviations.
منابع مشابه
Mining Atomic Chinese Abbreviation Pairs with a Probabilistic Single Character Word Recovery Model
An HMM-based Single Character Recovery (SCR) Model is proposed in this paper to extract a large set of “atomic abbreviation pairs”from a text corpus. By an “atomic abbreviation pair,”it refers to an abbreviated word and its root word (i.e., unabbreviated form) in which the abbreviation is a single Chinese character. This task is important since Chinese abbreviations cannot be enumerated exhaust...
متن کاملMining Atomic Chinese Abbreviation Pairs: A Probabilistic Model for Single Character Word Recovery
An HMM-based Single Character Recovery (SCR) Model is proposed in this paper to extract a large set of “atomic abbreviation pairs”from a large text corpus. By an “atomic abbreviation pair,”it refers to an abbreviated word and its root word (i.e., unabbreviated form) in which the abbreviation is a single Chinese character. This task is interesting since the abbreviation process for Chinese compo...
متن کاملA Preliminary Study on Probabilistic Models for Chinese Abbreviations
Chinese abbreviations are widely used in the modern Chinese texts. They are a special form of unknown words, including many named entities. This results in difficulty for correct Chinese processing. In this study, the Chinese abbreviation problem is regarded as an error recovery problem in which the suspect root words are the “errors” to be recovered from a set of candidates. Such a problem is ...
متن کاملThe use of probabilistic lexicality cues for word segmentation in Chinese reading.
In an eye-tracking experiment we examined whether Chinese readers were sensitive to information concerning how often a Chinese character appears as a single-character word versus the first character in a two-character word, and whether readers use this information to segment words and adjust the amount of parafoveal processing of subsequent characters during reading. Participants read sentences...
متن کاملA Probabilistic Bayesian Classifier Approach for Breast Cancer Diagnosis and Prognosis
Basically, medical diagnosis problems are the most effective component of treatment policies. Recently, significant advances have been formed in medical diagnosis fields using data mining techniques. Data mining or Knowledge Discovery is searching large databases to discover patterns and evaluate the probability of next occurrences. In this paper, Bayesian Classifier is used as a Non-linear dat...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Language Resources and Evaluation
دوره 40 شماره
صفحات -
تاریخ انتشار 2006